Week 04
Data Visualisation

SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research


Semester 1, 2026
Last updated: 2026-01-22

Francesco Bailo

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Learning Objectives

By the end of this week, you will be able to:

  • Create effective visualisations with ggplot2
  • Apply principles of good graphics
  • Choose appropriate plot types for different data
  • Compare distributions and identify relationships

This Week’s Readings

TSwD

  • Ch 5: Static communication
    • 5.2 Graphs (bar charts, scatterplots, line plots, histograms, boxplots)

ROS

  • Ch 2.3-2.5: All graphs are comparisons, Data and adjustment

Why Visualise Data?

Always Plot Your Data

The Golden Rule

Summary statistics alone can be misleading. Always visualise your data before drawing conclusions.

“A world turning to a saner and richer civilisation will be a world turning to charts.” — Karsten (1923)

The Datasaurus Dozen

Consider these four datasets with nearly identical summary statistics:

Dataset x mean x sd y mean y sd
dino 54.3 16.8 47.8 26.9
away 54.3 16.8 47.8 26.9
star 54.3 16.8 47.8 26.9
bullseye 54.3 16.8 47.8 26.9

What do they actually look like?

The Datasaurus Dozen: Revealed

Anscombe’s Quartet

The same lesson from the statistician Frank Anscombe:

All Graphs Are Comparisons

Graphs Enable Comparison

Key Insight from ROS

Every graph is fundamentally a comparison: to zero, to a reference line, to other data points, or to our expectations.

When making graphs, we should:

  • Line things up so the most important comparisons are clearest
  • Comparisons are clearest when scales are aligned
  • Consider what the reader needs to see

Example: Health Spending and Life Expectancy

Display More Information

A scatterplot can display up to five variables easily:

  1. x position
  2. y position
  3. Symbol shape
  4. Symbol size
  5. Symbol colour

A grid of plots adds two more dimensions!

The Grammar of Graphics

ggplot2: A Layered Approach

ggplot2 implements the grammar of graphics:

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) +
  <ADDITIONAL_LAYERS>

Key Components

  • Data: Your dataset
  • Aesthetics: Visual mappings
  • Geoms: Geometric objects
  • Scales: Axis and colour scales
  • Facets: Subplots
  • Themes: Overall appearance

Common Geoms

  • geom_bar(): Bar charts
  • geom_point(): Scatterplots
  • geom_line(): Line plots
  • geom_histogram(): Histograms
  • geom_boxplot(): Boxplots

Bar Charts

When to Use Bar Charts

Bar charts are ideal when you have a categorical variable that you want to focus on.

# Using built-in mpg dataset
mpg |>
  ggplot(aes(x = class)) +
  geom_bar() +
  theme_minimal() +
  labs(
    x = "Vehicle class",
    y = "Count"
  )

Bar Charts with Colour

Add a second variable using fill:

mpg |>
  ggplot(aes(x = class, fill = drv)) +
  geom_bar() +
  theme_minimal() +
  labs(
    x = "Vehicle class",
    y = "Count",
    fill = "Drive type"
  ) +
  theme(legend.position = "bottom")

Side-by-Side Bars

Use position = "dodge2" for side-by-side comparison:

mpg |>
  ggplot(aes(x = class, fill = drv)) +
  geom_bar(position = "dodge2") +
  theme_minimal() +
  labs(
    x = "Vehicle class",
    y = "Count",
    fill = "Drive type"
  ) +
  theme(legend.position = "bottom")

geom_bar() vs geom_col()

mpg |>
  ggplot(aes(x = class)) +
  geom_bar()
mpg |>
  count(class) |>
  ggplot(aes(x = class, y = n)) +
  geom_col()

geom_bar()

  • Counts observations automatically
  • Use when you have raw data

geom_col()

  • Uses values you provide
  • Use when you have pre-computed counts

Scatterplots

When to Use Scatterplots

Scatterplots show the relationship between two continuous variables.

Expert Advice

“A scatterplot may not always be the best choice, but it is rarely a bad one.” — Weissgerber et al. (2015)

Some consider it the most versatile and useful graph option.

Basic Scatterplot

mpg |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Engine displacement (L)",
    y = "Highway MPG"
  )

Adding Colour and Shape

mpg |>
  ggplot(aes(x = displ, y = hwy, 
             colour = class)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(
    x = "Engine displacement (L)",
    y = "Highway MPG",
    colour = "Vehicle class"
  ) +
  theme(legend.position = "bottom")

Handling Overlapping Points

Two strategies for overlapping points:

mpg |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.5) +
  theme_minimal()
mpg |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_jitter(width = 0.2, height = 0.5) +
  theme_minimal()

Transparency (alpha)

Jitter

Adding a Trend Line

mpg |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", 
              colour = "red",
              se = TRUE) +
  theme_minimal() +
  labs(
    x = "Engine displacement (L)",
    y = "Highway MPG"
  )

Line Plots

When to Use Line Plots

Line plots are ideal when data points should be connected, typically for:

  • Time series data
  • Sequential measurements
  • Continuous processes

Basic Line Plot

economics |>
  ggplot(aes(x = date, y = unemploy)) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Date",
    y = "Unemployment (thousands)",
    caption = "Data source: US Bureau of Labor Statistics"
  )

Multiple Lines

economics_long |>
  filter(variable %in% c("psavert", "uempmed")) |>
  ggplot(aes(x = date, y = value, 
             colour = variable)) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Date",
    y = "Value",
    colour = "Variable"
  ) +
  theme(legend.position = "bottom")

Step Plots

Use geom_step() to emphasise discrete changes:

economics |>
  filter(date >= "2005-01-01", 
         date <= "2010-01-01") |>
  ggplot(aes(x = date, y = unemploy)) +
  geom_step() +
  theme_minimal() +
  labs(
    x = "Date",
    y = "Unemployment (thousands)"
  )

Histograms

When to Use Histograms

Histograms show the distribution of a continuous variable by:

  1. Splitting the range into “bins”
  2. Counting observations in each bin
  3. Displaying as bars

Basic Histogram

mpg |>
  ggplot(aes(x = hwy)) +
  geom_histogram() +
  theme_minimal() +
  labs(
    x = "Highway MPG",
    y = "Count"
  )

Choosing Bins

The number of bins affects interpretation:

mpg |>
  ggplot(aes(x = hwy)) +
  geom_histogram(bins = 10) +
  theme_minimal() +
  labs(title = "10 bins")
mpg |>
  ggplot(aes(x = hwy)) +
  geom_histogram(bins = 30) +
  theme_minimal() +
  labs(title = "30 bins")

Note

Too few bins = too much smoothing. Too many bins = too much noise.

Comparing Distributions

Use geom_freqpoly() to compare groups:

mpg |>
  ggplot(aes(x = hwy, colour = drv)) +
  geom_freqpoly(binwidth = 2) +
  theme_minimal() +
  labs(
    x = "Highway MPG",
    y = "Count",
    colour = "Drive type"
  ) +
  theme(legend.position = "bottom")

Empirical Cumulative Distribution

An alternative view with stat_ecdf():

mpg |>
  ggplot(aes(x = hwy, colour = drv)) +
  stat_ecdf() +
  theme_minimal() +
  labs(
    x = "Highway MPG",
    y = "Cumulative proportion",
    colour = "Drive type"
  ) +
  theme(legend.position = "bottom")

Boxplots

What Boxplots Show

A boxplot displays five key statistics:

  1. Median (middle line)
  2. 25th percentile (bottom of box)
  3. 75th percentile (top of box)
  4. Whiskers (1.5 × IQR from box edges)
  5. Outliers (points beyond whiskers)

Basic Boxplot

mpg |>
  ggplot(aes(x = class, y = hwy)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    x = "Vehicle class",
    y = "Highway MPG"
  )

Boxplots Hide Distribution Shape!

Warning

The same boxplot can represent very different distributions!

Boxplots + Points

Show the actual data alongside summary statistics:

mpg |>
  ggplot(aes(x = class, y = hwy)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.3, 
              width = 0.2) +
  theme_minimal() +
  labs(
    x = "Vehicle class",
    y = "Highway MPG"
  )

Customising Your Plots

Themes

ggplot2 includes several built-in themes:

Labels and Titles

Use labs() to add context:

mpg |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point() +
  theme_minimal() +
  labs(
    title = "Fuel Efficiency vs Engine Size",
    subtitle = "Data from 1999 and 2008",
    x = "Engine displacement (litres)",
    y = "Highway fuel economy (mpg)",
    caption = "Source: EPA fuel economy data"
  )

Facets: Small Multiples

Split your plot by a categorical variable:

mpg |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(vars(drv)) +
  theme_minimal() +
  labs(
    x = "Engine displacement (L)",
    y = "Highway MPG"
  )

Facet Grid

Create a two-dimensional grid of panels:

mpg |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(rows = vars(drv),
             cols = vars(cyl)) +
  theme_minimal() +
  labs(
    x = "Engine displacement (L)",
    y = "Highway MPG"
  )

Colour Palettes

mpg |>
  ggplot(aes(x = displ, y = hwy, 
             colour = class)) +
  geom_point() +
  scale_colour_brewer(palette = "Set1") +
  theme_minimal()
mpg |>
  ggplot(aes(x = displ, y = hwy, 
             colour = class)) +
  geom_point() +
  scale_colour_viridis_d() +
  theme_minimal()

RColorBrewer

Viridis (colour-blind friendly)

Combining Plots

The patchwork Package

Combine multiple plots with patchwork:

library(patchwork)

p1 <- mpg |> ggplot(aes(x = class)) + 
  geom_bar() + theme_minimal()

p2 <- mpg |> ggplot(aes(x = hwy)) + 
  geom_histogram(bins = 20) + theme_minimal()

p1 + p2

Complex Layouts

p1 <- mpg |> ggplot(aes(x = class)) + geom_bar() + theme_minimal()
p2 <- mpg |> ggplot(aes(x = hwy)) + geom_histogram(bins = 20) + theme_minimal()
p3 <- mpg |> ggplot(aes(x = displ, y = hwy)) + geom_point() + theme_minimal()

(p1 + p2) / p3

Principles of Effective Visualisation

Key Principles

Design for Your Audience

The success of a graph depends on how much information is lost in the encoding-decoding process.

  1. Show the actual data — not just summaries
  2. Make comparisons easy — align scales, use consistent colours
  3. Avoid unnecessary decoration — every element should serve a purpose
  4. Consider your audience — match complexity to their expertise
  5. Label clearly — include titles, axis labels, and sources

What to Avoid

Don’t

  • Use 3D effects
  • Truncate axes misleadingly
  • Use pie charts for comparison
  • Add unnecessary colour
  • Forget labels and sources

Do

  • Keep it simple
  • Start y-axis at zero (usually)
  • Use bar/dot plots for comparison
  • Use colour meaningfully
  • Document everything

Significant Digits

Reporting Numbers

Don’t report numbers to too many decimal places. Display precision that respects the uncertainty in your data.

  • [3.276, 6.410] → Better written as [3.3, 6.4]
  • Three significant digits are usually enough
  • In R: options(digits = 2) sets global rounding

Summary

This Week’s Key Points

  1. Always plot your data — summary statistics can be misleading

  2. All graphs are comparisons — design to make key comparisons clear

  3. Choose appropriate geoms:

    • Bar charts for categorical counts
    • Scatterplots for two continuous variables
    • Line plots for connected/sequential data
    • Histograms for distributions
    • Boxplots (+ points) for comparing distributions
  4. Customise thoughtfully — themes, labels, colours, and facets

Next Week

Week 5: Data Cleaning and Probability Simulation

  • Systematic data cleaning workflow
  • Writing tests for data quality
  • Understanding probability distributions
  • Simulating data for validation

Readings

  • TSwD Ch 9: Clean and prepare
  • ROS Ch 3-5: Probability and simulation

References